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Abstract 

Many sparse linear discriminant analysis (LDA) methods have been proposed to 
overcome the major problems of the classic LDA in high-dimensional settings. How¬ 
ever, the asymptotic optimality results are limited to the case that there are only two 
classes, which is due to the fact that the classification boundary of LDA is a hyper¬ 
plane and explicit formulas exist for the classification error in this case. In the situation 
where there are more than two classes, the classification boundary is usually compli¬ 
cated and no explicit formulas for the classification errors exist. In this paper, we 
consider the asymptotic optimality in the high-dimensional settings for a large family 
of linear classification rules with arbitrary number of classes under the situation of mul¬ 
tivariate normal distribution. Our main theorem provides easy-to-check criteria for the 
asymptotic optimality of a general classification rule in this family as dimensionality 
and sample size both go to infinity and the number of classes is arbitrary. We establish 
the corresponding convergence rates. The general theory is applied to the classic LDA 
and the extensions of two recently proposed sparse LDA methods to obtain the asymp¬ 
totic optimality. We conduct simulation studies on the extended methods in various 
settings. 



Key Words: Linear discriminant analysis; Sparse linear discriminant analysis; Asymptotic 
optimality. 
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1 Introduction 


As an important classification method, the linear discriminant analysis (LDA) performs well 
in the settings of small p and large n. However, it faces major problems for high-dimensional 


data with large p and small n. In theory, Bickel and Levina (2004) and Shao et al. (2011) 


showed that in the case of p > n, the classic LDA can be as bad as the random guessing. 
To address these problems, various regularized discriminant analysis methods have been 


proposed, including those described ii 

i Friedman 

(1989), 

Krzanowski et al. 

(1995) 

, |Dudoit 

et al. 

2001), Bickel and Levina 

(2004) 

, Guo et al. 

(2007) 

Xu et al. (2009), 

Tibshirani et al. 

(2002) 

, Witten and Tibshirani ( 

2011), 

Clemmensen et al. 

(2011), Shao et al. 

(2onb, 

Cai and 


Liu (2011), Fan et al. (2012), and many others. Asymptotic optimality has been established 


in some of these papers (Shao et al, 2011 Cai and Liu, 2011 Fan et al. 2012). However, 


these asymptotic optimality results are limited to the case where there are only two classes. 
The major difficulty preventing the derivation of asymptotic optimality for the multiclass 
classification is that for the two-class classification, the classification boundary of LDA is a 
hyperplane and there exist explicit formulas for the classification error, however, when the 
number of classes is greater than two, the classification boundary is usually complicated and 
no explicit formula for the classification error exists. 

In this paper, we consider the asymptotic optimality in high-dimensional settings for a 
large family of linear classification rules with arbitrary number of classes under the situation 
of multivariate normal distribution. The classification rules of the optimal LDA, the classic 


LDA, and those in Shao et al. (2011) and Cai and Liu (2011) all belong to this family. 


We first provide an upper bound on the difference between the conditional classification 
error of any classification rule in this family and the optimal classification error for arbitrary 
n, p and K (the number of classes). Through an example, we illustrate that there exist 
situations where this bound is asymptotic optimal. Based on the upper bound, we develop 
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our main theorem which provides the conditions leading to the asymptotic optimality for a 
general classification rule in this family as dimensionality and sample size both go to infinity 
and the number of classes is arbitrary. These conditions are relatively easily verified for 
various particular classification rules in this family. We establish the convergence rates for 
the asymptotic optimality under these conditions. Then we apply this theorem to several 


particular classification rules. We extend the sparse LDA methods in Shao et al. (2011) and 


Cai and Liu (2011) from the two-class situations to the multi-class situations, and apply our 


general theorem to the two extended methods to obtain the asymptotic optimality and the 
corresponding convergence rates. Simulation studies are performed to evaluate the predictive 
performance of the two extended methods in various settings. 

The rest of this paper is organized as follows. In Section 2, after introducing the notations 
and main assumptions, the classic and two sparse LDA methods are described shortly. In 
Section 3, we introduce a family of linear classification rules and provide the main theorems. 
A necessary condition for the asymptotic optimality of the usual LDA and corresponding 
convergence rate are given in Section 4 as p, n and K all go to infinity. In Sections 5 and 6, 


two sparse LDA methods in Shao et al. (2011) and Cai and Liu (2011) are extended to the 
multiclass cases, and the asymptotic optimality and the corresponding convergence rates are 
provided. Simulation studies are performed in Section 7. A short discussion is provided in 
Section 8. All the proofs can be found in the supplementary materials. 


2 LDA and sparse LDA 

We first introduce some notations. For any vector v = (ui, • • • , u p ) T , let ||v|| 2 , ||v||i and 
||v||oo = m ax!<j< p |uj| be the Z 2 , h and norms of v, respectively. For any p x p symmetric 
nonnegative definite matrix M, we use A max (M) and A m ; n (M) to denote its largest and 
smallest eigenvalues, respectively. We define two norms for M. 

IMII = sup ||Mv|| 2 , and HMHoo = max I M k i\ 

vGRP,||v || 2 = l 1 <k,l<P 

where M kl is the (k , Z)th entry of M. The first one is the operator norm and the second one 
is the max norm. 
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In this paper, we assume that there are K classes and the population in the ?'th class has 
a multivariate normal distribution N p ( y pi i ^ X), where pL i is the ith class mean, 1 < i < Ii, and 
X is the common covariance matrix for all classes. We assume that the prior probabilities 
for all the classes are the same and equal to 1/K. We will consider the situations where 
both n and p go to infinity, and K is arbitrary. 

We first present two key regularity conditions for our theory. 

Condition 1. There is a constant cq (independent of p and K) such that 

A A — A) - 

C 0 

Condition 2. There exists a constant c± > 0 which does not depend on p and K, such that 


min ~ M,) > 




Ci. 


Condition [I] is the same as (2) in Shao et al. (2011). To understand the meaning of 
the inequality in Condition [2j we consider the optimal linear discriminant rule. For any 
x, y G M p , let d E -i(x, y) = a/(x — y) T X _1 (x — y) which is the well known Mahalanobis 


distance. The optimal linear classification rule is given by (see Section 6.8 in Anderson 


(2003) or Theorem 13.2 in Hardle and Simar (2012)), 


Tqpt'- to allocate a new observation x to the ith class if d s -i(x, /xj is the smallest 

among the K distances: d s -i(x, /x x ), d E -i(x, ■ • • , d E -i(x, n K ). (2.1) 


We use Ropt to denote the misclassihcation rate of Topt ■ It is well known that under the 
assumptions on the population distributions in this paper, Tqpt is the Bayes rule and Ropt 
is the smallest among the misclassihcation rates of all possible classification rules. Condition 
[2] implies that the squared Mahalanobis distance between any two class means is not less 
than C\. If this condition is not satisfied, some class means will approach each other as n, 
p —» 00 and these classes will be completely mixed together. In the case of two classes, we 
have an explicit formula for Ropt'- Ropt = $(— (ix 1 — /x 2 ) T S _1 (^ 1 — /z 2 )/2), where $ is 
the cumulative distribution function of the standard normal distribution (see Section 13.1 
Hardle and Simar (2012|). If Condition |2] is not satisfied, we have Ropt —■ y 1/2, which is 


m 


the misclassihcation rate of a random guess. Condition [2] excludes these situations. 
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In practice, /r ; , 1 < i < K, and X are all unknown. Let X = {x, :j : 1 < i < K , 1 < j < n t } 
be a sample data set from the population, where Xjj is the jth observation from the ith class 
and rii is the number of the observations from the ith class. Throughout this paper, we use 

^ K Tli 

Xi = — X = - y^( x ij - Xj)( x ij - Xj) T , 1 <i<K, 

n% 3 = 1 n i=1 3=1 

to denote the sample class means and the sample within-class covariance matrix which are 
estimates of and X, respectively. The classic LDA rule is given by 


Tlda'- to allocate a new observation x to the ith class if d~- i(x, x*) is the smallest among 

dg-i(x,X!),(ig-i(x,X2), ■ • • ,d £ -i(x,xjr). 

LInlikc Tqpt, the rule T LD a depends on the sample X. It has been argued in Chapter 1 in 


Devroye et al. (2013) that the conditional misclassihcation rate is a more natural measure of 


the predictive performance of a classification rule built based on X than the unconditional 
misclassihcation rate. Let T be any linear classification rule based on X. The conditional 
misclassihcation rate of T given X is defined as 

K / 

.Rt(X) = ( {x new belongs to the i-th class but T(x new ) ^ i} 

i= 1 


X 


where x new is a new observation independent of X. P^(X) is a function of X and the uncon¬ 
ditional misclassihcation rate is the expectation of Pj-(X) with respect to the distribution of 
X. For simplicity, we use Plda(X) to denote the conditional misclassihcation rate of Tlda- 
In the high-dimensional settings, the classic LDA performs poorly and can even fail 
completely (Bickel and Levina 2004 Shao et al . |2011 ). To revise the classic LDA in the 
high-dimensional settings, we note that d E -i(x, n { ) < d s -i (x, fx-) is equivalent to 


(M f - /0 T £ '(x 


x— >h±*i )<0 . 


Shao et al. (2011) imposed sparsity conditions on X and S = ^i 2 ~ AL, an d proposed the 


following sparse LDA classification rule for two classes: 

Tslda- to allocate a new observation x to the hrst class if S X (x- —— —-) < 0, 


and to the second class otherwise, 


( 2 . 2 ) 
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where X and <5 are the thresholding estimates of X and 5 , respectively. We use -Rslda(X) 
to denote the conditional misclassihcation rate of Tslda given the sample X. 


Cai and Liu (2011) further observed that in the case of two classes, the optimal classifi¬ 
cation rule Tqpt depends on X only through (3 = X _1 <5. Hence, they only assumed that (3 
is sparse and proposed a sparse estimate /3 of (3 based on a linear programming optimization 
problem. Then they proposed the following linear programming discriminant (LPD) rule for 
two classes, 


Tlpd'- to allocate a new observation x to the first class if /3 (x 
and to the second class otherwise. 


xi + x 2 


)<o, 


(2.3) 


We use i?LP£i(X) to denote the conditional misclassihcation rate of Tlpd given the sample 


X. The following definition of asymptotic optimality has been used in Shao et al. (2011), 


Cai and Liu (2011) and other papers. 


Definition 1. Let T be a linear classification rule with conditional mis classification rate 
-Rp(X). Then T is asymptotically optimal if 

-Rt(X) 


Ropt 


- 1 = 0 ,( 1 ). 


(2.4) 


Since 0 < Ropt < i?r(X) < 1 for any X, (2.4) implies that 0 < i?p(X) — Ropt = o P (l). 
Hence we have i?p(X) —>• Rqpt in probability and i?[i?p(X)] —> Rqpt , which have been 


used to define the consistency of a classification rule by Devroye et al. (2013) and others. If 


Rqpt is bounded away from 0, then i?p(X) — Rqpt = o p (l) also implies (2.4). However, if 


Rqpt 0, (2.4) is stronger than i?p(X) — Rqpt = °p(l)- 


3 Upper bounds and convergence rates 

In this section, we consider a family of linear classification rules motivated by the following 
observation. The optimal classification rule Tqpt can be rewritten in the following way. Let 


a ji = s 1/2 Oj - Mi ), b a = 2(Mj + Mi ), 


(3.1) 
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where 1 < i, j < Ah Then Topp assigns a new observation x to the ith class if aJ-S~ 1//2 (x — 
b ji) < 0 for all j 7 ^ i. Based on this observation, we consider the family of classification 
rules having the form, 

T: to assign a new x to the ith class if ajE 1 / 2 (x-b^) < 0, for all j i, (3.2) 

where a ^ and b ?) ; are p-dimensional vectors which may depend on the sample X, and satisfy 

a ji a t j . b ji bjj, (3.3) 

for all 1 < i 7 ^ j < K. In addition to Topt, the classification rules Tlda , Tslda and Tlpd 
all belong to this family. The specific expressions of a J7 and b Jt for them will be given in 
the following sections. We first provide an upper bound on Ap(X) — Ropt in the following 
theorem. 

Theorem 3.1. 


A r (X) — Ropt 

i ,f 


(3.4) 


i =1 j¥=i 


<p E E P ( S P > S J* S_1/2 ( b ii - Mi). aj,z < $Z- 1/2 (b 


-V 2 fh,, - 


7* /A 


X 


where Z is a p-dimensional random vector with distribution N( 0, I p ) and independent of the 
sample X. For the special case of K = 2, we have the following equality, 


A t (X) — Rqpt 


(3.5) 


1 

2 


EE 


P ajz > aJjS~ 1/2 (bjj 


Mi ), 


aJ,Z < aJ,S -1 i 2 (b,i - Mi ) 


X 


*=i jW 



nofes t/ie conditional probability of the event {aJ.Z > aJS-V 2 ( bj , _^)}n(aJ t Z < 
aJjS _1 / 2 (bjj — //J} given X. Since Z is independent of X, u>/ien we calculate the above 
conditional probability, we just need to consider a.ji and b ^ as constants and calculate the 
probability with respect to the distribution of Z. The same interpretations will be applied 
throughout this paper. 
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In the case of K = 2, we obtained an equality (3.5), but for K > 2, the upper bound on 


the right hand side of (3.4) is usually greater than Rt(X) — Ropt ■ So a natural question 
is whether the upper bound is asymptotically optimal, that is, whether the ratio between 
i?r(X) — Ropt and the bound converges to 1 . However, it is hard to answer this question 
for general situations. We provide an example where the ratio converges to 1. This example 
is somewhat artificial, but serves to illustrate that there exist situations where the bound in 


(3.9) is asymptotically optimal and gives some insights into the probabilities in the upper 
bound. The example is shown in Figure [lj where K = 3 and p = 2 are fixed. We assume that 



Figure 1: Illustration of an example where the upper bound in (3.4) is asymptotically 
optimal. Here K=3 and /x,, /x 2 and /x 3 are the class means and the distance between any 
two of them is the same and equal to 2 d n . The solid lines, 71 , 72 and 73 , denote the 
boundary of Topt and the dashed lines, 7I, 72 and 73 , are the classification boundary of T. 
73 is superposed on 73 . d n and d n + e n are distances from /x x to the boundary lines of Tqpt 
and T, respectively. A is the region bounded by 72 , 72 and 73 , and B is the region bounded 
by 7 i, 71 and 73 , respectively. 


the pairwise distance between any two pairs of class means are the same, and S = (0 1 ) • The 
boundary lines 71 and 72 are parallel to 71 and 72 , respectively, and the distances between 
boundary lines are the same and equal to e n . The regions A is actually the collection of 










points which are assigned to class 2 under Tqpt but assigned to class 1 under T, and B is 
the collection of points which are assigned to class 3 under Topt but assigned to class 1 
under T. Hence, we can see that 

-Rt(X) - R 0 pt = \[P2{A) + P 3 (P) - P\(A) - Pi(B)} = jj[P 3 (P ) - P(P)], (3.6) 

where Pj(-) denotes the conditional probability P.|x(-|x G the zth class) which is just the 


distribution E). The last equality in (3.6) is because of the equalities P 2 (H) = P 3 (P) 

and Pi { A) = P] (B) by the symmetry of n 1 , [i 2 and ^ 3 and the identity covariance matrix. 
By the definitions of Topt and T, 71 , y 2 and 73 are the lines satisfying the equations, 
a^ 3 E _ 1 7 2 (x — bi 3 ) = 0, aJ 1 E _1//2 (x — b 2 i) = 0, and a^ 3 E _1/,2 (x — b 23 ) = 0, respectively, and 
Tx, 7 ^ and 73 are the lines satisfying equations, af 3 E _ 1 7 2 (x —b 13 ) = 0 , a^i^ ^ 2 (x — b 21 ) = 0 , 
and a 23 S 1/2 (x —b 23 ) = 0, respectively. The following relationship can be found in the proof 
of Theorem |3.1[ 


P (aJ,Z > 8j5T 1/2 (b,i - Pi), aJ,Z < - M, 

=Pi (ajE-^fx - bji) > 0, aJ i S -I ^ 2 (x - bj,) < O' 


X 


(3.7) 


Therefore, for i = 3, 




E p ( S J< Z > s Ji s ~ 1,2 (bj‘ - Pi), 4 Z < a J. s ~ 1/2 (tV - Pi) 


X 


=P 3 ^ 3 E- 1 / 2 (x - b 13 ) > 0, a^ 3 E- 1 / 2 (x - b 13 ) < 0^ 

+ P 3 fes- 1 / 2 (x - b 23 ) > 0, aJ 3 E- 1 / 2 (x - b 23 ) < 0) , 


(3.S) 

where {af 3 E _1//2 (x — b 13 ) > 0} denotes the half space above the whole line of which 7 j is 
the left half part, and {a^ 3 E -1//2 (x — b 33 ) < 0 } denote the half space below the whole line 
of which 7 x is the left half part. The intersection of these two half spaces is the strip region 
between the two whole lines and is denoted by B. The event in the second probability on the 


right hand side of (3.8) is empty and the sum of the two probabilities is P 3 (P). Similarly, 


we can calculate the other two sums for i = 1,2 and then it can be shown that the right 


hand side of (3.4) is equal to |P 3 (P). We will show that P 3 (P)/[P 3 (P) — Pi(P)] —> 1. Then 
in this situation, the upper bound is asymptotically optimal. Let d n —> 00 and d n e n —> 00. 
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Without loss of generality, we assume that 71 is on the x-axis and is on the y-axis. Then 
P 3 is the two dimensional normal distribution with mean (0, — d n ) and the covariance matrix 
E equal to the identity matrix. By the relationship 


< y, ~e n < y < 0} C B C B = {-e n < y < 0}, 


we have 


p 3 ( x < -e n < y < 0 ) < P 3 (B) < P 3 (P ) < P(-e n < y < 0), 


and hence 


Ps(B) 


< 


P 3 (-e n < y < 0 ) 


->• 1 . 


Pa(B) ' P 3 (x < -e n < y < 0) P 3 (x < 

Similarly, Pi is the two dimensional normal distribution with mean (0, d n ) and the identity 
covariance matrix. Hence, 


Pi (B) ^ P,(P) ^ Pi(-e„ < y < 0) Ji n+£n 0(x)dx <j>(x + e n )dx 


< 


< 


P 3 (B) P 3 (P) Ps(-e n <y<0) fZ:_6(x)dx ftej( x ) dx 


< e 


0 . 


So we have P 3 (P)/[P 3 (P) — P 3 (P)] —>• 1. In this example, for simplicity, we make the bound¬ 
aries of T and Topt to be parallel, which is not necessary for the bound to be asymptotically 
optimal. 

We observe that the asymptotic convergence rate based on the upper bound for R t / R 0 pt~ 
1 obtained when K > 2 (even if K is fixed) is slower than that when K = 2. This phenomena 
can be explained by comparing (3.4) and (|3.5). I 11 the case of K = 2, there is a negative term 


on the right hand side of (3.5), which has the same order as the positive term and the differ¬ 
ence between the two terms has a higher order convergence rate for the classification rules 


we consider in this paper. But there is no negative terms in (3.4). Hence, the asymptotic 
convergence rate for K > 2 (even if K is fixed) is slower than that for K = 2. We use Figure 
[2] to illustrate the difference between these two cases. The solid lines are the classification 
boundaries of Topt and the dashed lines are the classification boundaries of T. I 11 the left 
figure, K = 2 and the plot is the projection of the p > 2 dimensional feature space onto the 
two dimensional space spanned by the vectors orthogonal to the two boundary hyperplanes, 
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Figure 2: The solid lines are the classification boundaries of Topt and the dashed lines are 
the classification boundaries of T. In the left figure, K = 2 and Hi and are the class 
means. In the right figure, K = 3. 


respectively. A and B denote the upper and lower regions between the two boundaries. We 
use Pi and P 2 to denote the probability distributions of N(fj, 1 ,I) and iV(/z 2 ,1), respectively. 
Using the same arguments as in the above example, we can obtain 

P T (X) - Ropt = - P X (A)] + l -[P 2 {A) - P 2 (B)\ . (3.9) 


For the classification rules considered in this paper, A and B actually form a “matched” 
pair in the sense that P\(B) — Pi (A) is much smaller than Pi(P) and Pi (A), and so is 
P 2 (A) — P 2 (P). However, when K > 2, the boundaries are usually complicated. The right 
figure is the case of K = 3 where we cannot find matched pairs as in the case of K = 2. 
Now we consider the asymptotic optimality of a classification rule T in the family. 


and bji in the definition (3.3) of T are typically estimates of a ji and b ji in (3.1) for Topt- 
Given a classification rule T with specific forms of a^ and b^, it is relatively easy to calculate 
the convergence rates of a ^ and hji as shown in the following sections. We will establish 
the asymptotic optimality of T and the convergence rate for Rt/Ropt — 1 based on the 
convergence rates of a ^ and hji. Let M min = mini <i^j<K{Rj — — /p) an d 

M ma x — maxi ~ /p) be the minimum and maximum Mahalanobis 


distances between any two of the K class means, respectively. By (3.1), ||||| = (p,. — 
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/^) T £ 1 (/ij — fAj). Therefore, under Condition [ 2 J we have 

Ci E LI m i n min ju^U^ E max 11a j/11 2 LL max . (3.10) 

Tj i¥=j 

M min and M max depend on p and K , and can go to infinity as p —» 00 . 

Theorem 3.2. Suppose that Conditions^ 7] and [1] hold and {s n : n > 1} is a sequence of 
nonrandom positive numbers with M max s n —>■ 0 as n —> 00 . For any 1 < j E i < K, let 


Qji tjj t T (a. 


3 l ) 


be an orthogonal decomposition of aj i} where t^a^ is the orthogonal projection of along 
the direction of a ji7 tji is a real number, and (ajj)j_ is orthogonal to t^aji. Let 


dp = aJS 1/2 (b ji - aO , dji = ajE 1/2 (b^ - nf) = -||a^| 


If the following conditions are satisfied, 


*31 112 


11 2 IMI 2 II || 2^p( , ®n) > tji IT Op^sfi ), dji dji 11 ay* 11 2 ), (3.11) 


where O p (s n ) are uniform for all 1 < j i < K, then we have 

Rr(X) 


Ropt 


1 = O p ( K 2 y/M max s n log \(M max s n ) 


-r 


(3.12) 


It is natural to ask whether the convergence rate in (3.12) can be improved. The following 


theorem answers the question for the case where K is bounded. 


Theorem 3.3. Assume that all the conditions in Theorem 3.2 hold. Moreover, suppose that 


there exists constants c 4 and c 5 independent of n, p and K, such that 


LI max / AI m i n < C4, mill 


II (ay*)L || I 


,, > C 5 S y 

I<i *i<K l|a J( ||| 


(3.13) 


Then with probability converging to 1, the upper bound in (3.4) 

K 


1 

K 


i=l j^i 


E E p T z > a l s_1/2 (bi - M.). aj.z < aj,s- 1/2 (b* - Mj ; 


X 


> 


C 5 


4^/cj 


\/' LImax $n RqPT ■ 


(3.14) 
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Remark 2. 1. The inequality (3.14) indicates that any convergence rate derived from the 


upper bound in (3.4) of Theorem 3.1 is at most y/M max s n and cannot be faster. If K 


is fixed or bounded, the convergence rate in (3.12) is different from \]M max s n only by 


a multiplying factor ^log [(M ma:r s n ) _1 ]. Hence, the convergence rate in (3.12) cannot 
be improved in this case except for the logarithm factor as long as the convergence rate 
is derived from the upper bound in Theorem 3.1. 


2. The first inequality in (3.13) indicates that all the Mahalanobis distances between the K 


class means have the same orders. If this assumption is not true, then there are some 
class means among which the distances will be much smaller than those from them to 
other class means. Ropt will be dominated by the errors between these classes (here, 
the error between the ith and jth classes means the probability that an observation from 
the ith class is assigned to the jth class or vice versus). 


3. The second inequality in (3.13) guarantees that the convergence rates o/||(a_jj)j_||| 


x ji II2 


is slower than s n . The conditions in (J3.11|) in Theorem 3.2 implies that the convergence 
rates of 


) _L 111 /11 111 s n or faster. Under these two conditions, the convergence 

rate of || (a^i)j_|||/1|aj*||| is exactly s n in Theorem 3.3. 


In the following sections, we will apply Theorem |3.2| to T LD a, the extended T S lda and 
Tlpd- 


4 Classic LDA 


The classification rule T of the classic LDA is a special case of the rule in (3.2) with 

--1 


i ji = £ 1/2 £ (xj - Xj), by = (xj + Xj)/2, 


(4.1) 


for all 1 < i 7 ^ j < K. Let a m and a^i denote the (k, Z)-th elements of X and X, respectively, 
where 1 < k,l < p. When there is one population, that is, K = 1, it has been shown in (12) 
of 


Bickel and Levina 

(2008 

) that rnaxfcj 

&kl ~ &kl 

= O p (Vlog p/nj 


number of classes K to go to infinity, in the following theorem, we will consider the effect of 
the large K. 
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Theorem 4.1. Suppose that Condition(l^holds. Then we have 

= O r 


max 

k,l 


K 


&kl ~ (1 - ) a kl 

n 


'log p 


n 


Moreover, 


|S — (1 — (t)£|| =O r h.,/ 1 '®' 7 ') , and hence ||S-£|| = O p (pj'ff + f 


(4.2) 


From Theorem 4T, one can see that a large K has a shrinkage effect on X, that is, when 
K is large, the entry au of X is close to the shrunk entry of X. The convergence rates in 


Theorem 4.1 plays a basic role in the following theoretical development. 


Shao et al. (2011) provided necessary conditions for the classic LDA to be asymptotically 


optimal in the case K = 2 as both n,p —> oo. We will extend the results to the case of 
K > 2. Before we state the theorem, we will exclude the situations where there are very 
small numbers of observations in some classes by assuming the following condition: 


Condition 3. There exists a constant C 2 independent of n, p, and K such that 

1 (K 

— - < Ca — 

mini <i< K rii \ n 


(4.3) 


This condition implies that Uj > min \<i<Kni > cf x njK for any 1 < j < K. Therefore, 
the number of observations in any class is of the same order as the average number n/K of 
observations. 


Theorem 4.2. Suppose that Conditions [Ilf^ hold and K < p + 1. Let 


'log p K 

Sn = PXI ~ + 7M=, 


p -^o. 


(4.4) 


Then if K 2 a/ M max s n log [( M max s n ) x ] 0, the classic LDA is asymptotically optimal and 


RldaQQ 

Ropt 


1 = O p (K 2 a/ M max s n log {(M max s n ) 


-l" 


(4.5) 


Remark 3. 1. We compare Theorem 4■ 2| with Theorem 1 in Shao et al. (2011) where it is 


assumed that only the first term p^/logp/n of s n in (4.4) converges to zero. The second 


term of s n in (4.4) is due to the effect of K. When K has the order y/M min p log p or 


larger, the second term in s n is not negligible. 
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2. If K is fixed, or bounded and p —* oo, then the second term of s n satisfies 

K 


a/ M„ 


= 0\p 


log p 


n 


In this case, the condition K 2 M max s n log [(M ma:r s n ) -1 ] —>■ 0 in Theorem f.2 


is equiv¬ 


alent to M max s n —> 0. When K = 2, this is the same as that in Theorem 1 in Shao 
et al. (2011). To see this, note that when K = 2, 

= (Ml - M 2 ) TS ~ 1 (Mi - M2) = M max , 


M n 


and hence the A 2 defined as (p, 1 — /x 2 ) T £ x (Mi — M 2 ) * n 


Shao et al. 


(2011) is equal 


to M max . Therefore, M max s n —> 0 is equivalent to A 2 s n —> 0 which is the condition in 


Theorem 1 in Shao et al. (2011). Moreover, as discussed in Section 3, when K > 2, 


the convergence rate in (4.5) is smaller than that for K = 2 in Theorem 1 in Shao 


et al. (2011) 


5 Sparse LDA by thresholding 

It has been shown that when p/n —> 00 , the classic LDA may not be asymptotically optimal 
in Theorem 2 of Shao et al. (2011). By imposing sparsity conditions on £ and p, 1 — p, 2 , Shao 


et al. (2011) proposed a sparse LDA rule by thresholding and proved that it is asymptotically 


optimal in the case of K = 2. In this section, we extend this method to arbitrary K and 


provide asymptotic optimality.. As in Shao et al. (2011), we consider the following sparsity 


measure on £ in Bickel and Levina (2008), 


Ch v — m ax > |cr w | 

,P l<k<p ft<l 

- 1=1 


(5.1) 


where 0 < h < 1 is a constant independent of p. Shao et al. (2011) used the sparse estimate 
of £ in 


Bickel and Levina 


(2008) by performing a thresholding procedure on £. In tis case, 


we need to consider the effect of large K. Hence, we propose to apply the thresholding 
procedure to (1 — K/n)~ l H, instead of to £. Specifically, let £ be the thresholding estimate 
with the (/c,/)th entry 


&kl = (1 —K/n) '(Ikll[(l—K/n)~ 1 \S} 1 i\>t n \ j VI <k,l <p, 
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where t n = M\ M\ is a large enough positive constant and Ski is the (k, /)th entry of 

S. We first derive the convergence rate for the revised thresholding estimator. 

Theorem 5 . 1 . Suppose that Condition [ 7 ] holds, log p/n = o(l) and 

flogp \ (1 ~ ,i)//2 


d n = C, 


h,p 


\ n J 


= 0 ( 1 ). 


Then 


' —i 


|S - S|| = O p {d n ), ||S ‘ - s- 1 !! = O p (d n ). (5.2) 

d n is the convergence rate of the thresholding estimate of the covariance matrix for one 


population in Bickel and Levina (2008). Hence, the revised thresholding estimate has the 


(5.3) 


same convergence rates for any K. Now define 

dji = Tj ~ Mu Sji — Xj — x, 

for all 1 < i,j < K. We define the following sparsity measure on <5^ which is an extension 
of (9) in Shao et al. (2011), 

D3 * = i<M<i < 5 - 4 ) 


k =1 


where S 1 ^ is the A;th coordinate of Sji and 0 < g < 1 is a constant independent of p and K. 
When K = 2, there is essentially one 6 j t because 821 = —^ 12 - In this case, the D g ^ p is just 


that in Shao et al. (2011). We extend the classification rule of the sparse LDA method in 


Shao et al. (2011) for arbitrary K as follows. 


T-1 


Tslda : to allocate a new observation x to the i class if 8 (x — b ? , ; ) < 0 , Vj 7 ^ i, 


where Sji is a sparse estimator of 8 Jt and bis an estimate of (/ij + Hi)/ 2. One may naturally 
take Sji to be 8 j t thresholded and h, Jt = (x,- + x^)/2 as in 


Shao et al. 


(2011). However, this 


choice of Sji and b ?i is problematic because in this case, there may exist multiple i’s which 


rT 1 


satisfy <5^(x — b J2 ;) < 0 , for all j ^ i , and hence we cannot uniquely determine the class 
to which we assign x. Therefore, we propose the following estimates. Define the following 
threshold for Sji, 

dogjA ° 


— M 2 


n J 


(5.5) 
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with M 2 > 0 and a G (0,1/2) are constants. Define the thresholding estimates 

<5n = 0, Sji = 1 < J < K, 


and let 


dy i 3ji dji, b ji Xi T 


Sjl + dj! 


, VI <i,j < K, 


where for simplicity, we first estimate Sji, 1 < j < K, but one can first estimate S j2 , 
1 < j < K, without any effects on the asymptotic optimality results. Under the above 

~T- 1 ~ 

definitions, d*£ (x - bji) < 0 if and only if 


1 


x-(d i:L + xi) S x-(dii + xi)l < (x-(dji + x 1 )) S x - (^i + x x 


' —l 


Hence, Tslda assigns x to the fth class if and only if ( x — (Sa + xi)) S (x — (Sn + xi 


T-1 


-i 


is the smallest among lx — (Sji + x x )) £ (x — (Sji + x^ ),!<)< K. It is easy to see 


that Tslda is a special case of (3.2) with 


a ji S V S Sji , hji hji X! + ^ 

Moreover, Let r > 1 be a fixed constant and define 


dp + da 


, VI < i j < K. 


q n = max {the number of k 's with d/, > a n lr\. 
i <j<K 3 


Then by Lemma 2 (ii) in Shao et al. (2011), with probability converging to 1, the number 
of nonzero coordinates of Sji is less than or equal to q n and hence, the number of nonzero 
coordinates of Sji is less than or equal to 2 q n . 


Theorem 5.2. Suppose that Conditions 


hold and 


b n = max < d n , 


a n ^Dg tP (Ch, p + K)q r 


y/M~ n 


yJnMr, 


—> 0 , 


(5.6) 


where d n is defined in Theorem\5.1. Then T S lda is asymptotically optimal and 


Rslda(X) 


R 


OPT 


- 1 = Op (k 2 \/M max b n log [(Mmaxbr, 


\-y 


(5.7) 


The only difference between b n in (5.6) and that in Theorem 3 in Shao et al. (2011) is 


that we use C/ iP + K in (5.6) to replace Ch, p in Shao et al. (2011) 
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6 Linear programming discriminant rule 


Cai and Liu (2011) observed that the optimal classification rule Tqpt depends on S only 
through the vectors f3 rji = where <5,-; is defined in (5.3), 1 < i, j < K. They proposed 


the linear programming discriminant rule where the key step is to estimate /3 through a 
constrained l\ minimization. We extend the LPD rule to the case where K is arbitrary. We 
first define the estimates (3 U = 0 and (3^ to be the solution to 


min ||/31|i, subject to ||S/3 - < A n , 

for any 1 < j < K, where S = (1 — Ji/n) _1 S, A„ = C^M max log p/n. Then let 

ft ji ftjl ft il) 


( 6 . 1 ) 


for any 1 < i,j < K. We use S in the optimization problem (6.1) instead of S as in Cai 


and Liu 

(2011 

) to remove the shrinkage effect of large K. By Theorem 

4.1 


15] — Slloo — Op 


logp 


n 


( 6 . 2 ) 


The classification rule is 

Tlpd : to allocate a new observation x to the ith class if /3 ?i (x — 


T , Xj + X, 


) < 0, Vj ^ i, 


which is of the general form (3.2) with 




= s 1 / 2 3«, = 


Theorem 6.1. Suppose that ConditionsUfiSihold and 


r n = ( \f KM max max || S 1 d :) '.j||i + max ||S 

\ l<p 

Then Tlpd is asymptotically optimal and 
RlpdW 


logp 


n 


-)■ 0 . 


(6.3) 


1 = O p 


i-il 


(6.4) 


Ropt 

If K is fixed or bounded. The condition is essentially the same as that in Theorem 3 of 


Cai and Liu (2011) 


18 


























7 Simulation Study 


In this section, we conduct simulation studies to evaluate the classification performance of the 
extended LPD and SLDA, and compare them with the nearest shrunken centroids method 
(NSC), the classic LDA rule with a generalized inverse matrix (GLDA) and the optimal 
classification rule. There are several different definitions for the generalized inverse matrix. 
We use the Moore-Penrose pseudoinverse and calculate it using the matlab function “pmP. 
Although we cannot calculate the optimal rule in practice, we include it in this simulation 
study as a benchmark. All the methods are implemented in matlab except NSC, which 
is implemented using the R package “pamr” with the default setting and cross-validation 
procedure. 

In LPD, we choose the tuning parameter A n from the set {0.2, 0.25, 0.3,..., 0.65, 0.7} in 


(6.1) by the five-fold cross-validation procedure. There are two tuning parameters in SLDA: 
t n = M\\J\ogp/n and a n = M 2 (logp/n)“, the thresholds for £ and Sji, respectively. Mi is 
chosen from {10~ 5 ,10 -4 ,..., 1} and M 2 is from {10~ 7 ,10 -6 ,..., 1}. We choose a = 0.3 as 
in Shao et al. (2011). One practical issue of SLDA is that the thresholded covariance matrix 


may not be invertible. We propose two different ways to overcome this problem and compare 
their performance in this study. In the first one, we use the Moore-Penrose pseudoinverse 


:—l 


of £ to replace £ in T S lda and denote this approach by SLDAl. In the second one, we 
replace £ by (£ + elp) -1 , where £ is a small positive number which will be viewed as a 
tuning parameter and I p is the p-dimensional identity matrix. (£ + el % 


\-i 


is the inverse of 


£ + el p if it is full rank, otherwise, (£ + el p ) _1 is the generalized inverse. This approach 
is denoted by SLDA2. We choose £ from the set {1CT 5 ,10 -4 ,10 -3 ,1CP 2 ,10 —1 }. There are 
two tuning parameters for SLDAl and three for SLDA2. For both SLDAl and SLDA2, we 
choose the tuning parameters by the five-fold cross-validation procedure. 

We will consider three models with different class means and within-class covariance 
matrices. For each model, we consider three different numbers of classes K = 3, 6, 9 and two 
different dimensionality p = 300, 600. The total number of observations from all the classes 
is fixed to be n = 450 for all models. The numbers of observations from all the classes are 
the same, that is, ni = n 2 = • • ■ = uk = 450/A'. Therefore, when K = 3, we have 150 
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observations from each class, and when K = 9, we have only 50 observations from each class. 
For all the three models, the class means have the forms: pi k = (0,..., 0,1,..., 1, 0,..., 0)', 

(k— l)so so p—ks o 

1 < k < K. There are s o ones and all the other numbers are equal to zero in the p- 
dimensional vectors of the class means. The specific details of the three models are given 
below: 


1. Sq = 5 and £ = (a. 


ij Jpxp 


with an = 1 for 1 < i < p and a i3 = 0.5 for i ^ j. 


£ = 


(7.1) 


2 . so = 3 and £ is a diagonal block matrix given by 

£n 0 
0 £22 

where £n is a 100 x 100 matrix with diagonal clement equal to 1 and off-diagonal 
element equal to 0.7, and £22 is a (p — 100) x (p — 100) matrix with diagonal element 
equal to 1 and off-diagonal element equal to 0.5. 

3. s 0 = 10 and £ = ( aij) pxp with a^ = 0.95l* —J I for i 7 ^ j. 


Models 1 and 3 are similar to those in Cai and Liu (2011). I 11 Model 1, all the entries of 


£ _1 and £ 1 <5^- (i 7 ^ j ) are nonzero. In Model 2, £ _1 is also a diagonal block matrix with 
two diagonal blocks equal to £ 7 / and £7 2 1 , respectively. All the entries of £ 7 ^ and £7 2 X are 
nonzero, but all the coordinates of the vector £ _ 1 <5 i:; ' (i 7 ^ j) are equal to zero except the first 
100 entries. In Model 3, the entries of £ decays exponentially with \i — j\ and hence satisfies 
the conditions for SLDA. The th entry of £ _1 is zero when \i — j\ > 1. Therefore, £ _1 
is a sparse matrix. 

For each setting in each model, we repeat the following procedure 50 times. In each 
repeat, we generate n = 450 training samples with n/K samples in each class, and then 
generate 450 test samples with n/K samples in each class independent of the training sam¬ 
ples. For each method (except the optimal rule), we use the training data to choose the 
tuning parameters and find the classification rule, and then we apply the fitted classification 
rule to the test data to obtain the classification errors. The average classification errors and 
standard deviations of the 50 replications for all the setting in the three models are listed in 
Table [Q 
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In all settings, SLDA2 has lower classification errors than SLDAl. Therefore, for SLDA, 
first adding a diagonal matrix with small common positive diagonal entries to the thresholded 
covariance matrix is better than directly calculating its generalized inverse matrix in terms 
of predictive performance. For all settings in both Models 1 and 2, LPD has the smallest 
classification errors compared to SLDAl, SLDA2, NSC and GLDA. In Model 3, when p = 
300, LPD has the smallest errors and when p = 600, SLDA2 has the smallest errors. For all 
the three models and all the methods, the classification errors increase with the increasing 
number of classes when p is fixed. Given Ii, the optimal errors are almost unchanged when p 
increases from 300 to 600, and the classification errors of all other methods increase. When 
both K and p are large, GLDA performs much worse than other methods and have errors 
close to or more than 50%. 

To evaluate the computational efficiency of the extended methods: LPD, SLDAl and 
SLDA2, we list the average time in seconds of running one replication (including 5 fold 
cross-validation) for Model 1 of the three methods in Table |2} All the computations are 
conducted on a compute cluster with Red Hat Linux. The CPU in each node of the cluster 
is : Intel(R) Xeon(R) CPU 5160 3.00GHz. All methods need longer execution time as p 
increases with K fixed. Compared to SLDAl and SLDA2, the execution time of LPD is 
much more sensitive to the number of classes. That is because the major computation load 


of LPD is to solve the optimization problem (6.1) for each 1 < j < K, and the major load 
of SLDA is the calculation of the inverse or generalized inverse matrix of the thresholded 
covariance matrix with or without a small diagonal matrix . 


8 Discussion 

In this paper, we aim to provide a general theory for the asymptotic optimality of the clas¬ 
sification rules in a large family in the high-dimensional settings with arbitrary number of 
classes. Our main theorem provides easy-to-check criteria for asymptotic optimality of the 
classification rules in this family and we establish the corresponding convergence rates as 
both dimensionality and sample size go to infinity and the number of classes is arbitrary. 
This general theory is applied to the classic LDA, the linear programming discriminant rule 
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by Cai and Liu (2011), and the sparse linear discriminant analysis rule by Shao et al. (2011). 
We extend the latter two methods to the case of multiclass. We establish the asymptotic 
optimality of the three methods and provide the convergence rates in the high-dimensional 
settings with arbitrary number of classes. Through simulation study, we demonstrate that 
the extended methods have good predictive performance when the conditions of these meth¬ 
ods are satisfied. 


Acknowledgments 


The second author is supported by NSF DMS 1208786. 


References 

Anderson, T. (2003) An Introduction to Multivariate Statistical Analysis, Third Edition. 
Wiley Series in Probability and Statistics. Wiley. 

Bickel, P. J. and Levina, E. (2004) Some theory for fisher’s linear discriminant function,’naive 
bayes’, and some alternatives when there are many more variables than observations. 
Bernoulli, 989-1010. 

Bickel, P. J. and Levina, E. (2008) Covariance regularization by thresholding. The Annals 
of Statistics, 2577-2604. 

Cai, T. and Liu, W. (2011) A direct estimation approach to sparse linear discriminant 
analysis. Journal of the American Statistical Association, 106. 

Clemmensen, L., Hastie, T., Witten, D. and Ersbll, B. (2011) Sparse discriminant analysis,. 
Technometrics, 53, 406-413. 

Devroye, L., Gyorh, L. and Lugosi, G. (2013) A probabilistic theory of pattern recognition, 
vol. 31. Springer Science & Business Media. 


22 








Dudoit, S., Fridlyand, J., and Speed, T. (2001) Comparison of discrimination methods for 
the classification of tumors using gene expression data. Journal of the American Statistical 
Association, 96, 1151-1160. 

Fan, J., Feng, Y. and Tong, X. (2012) A road to classification in high dimensional space: the 
regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B 
(Statistical Methodology), 74, 745-771. 

Friedman, J. (1989) Regularized discriminant analysis. Journal of the American Statistical 
Association, 84, 165-175. 

Guo, Y., Hastie, T. and Tibshirani, R. (2007) Regularized linear discriminant analysis and 
its applications in microarrays. Biostatistics, 8, 86-100. 

Hardle, W. and Sirnar, L. (2012) Applied multivariate statistical analysis, Third Edition. 
Springer. 

Krzanowski, W., Jonathan, P., McCarthy, W., and Thomas, M. (1995) Discriminant anal¬ 
ysis with singular covariance matrices: Methods and applications to spectroscopic data. 
Journal of the Royal Statistical Society, 44, 101-115. 

Shao, J., Wang, Y., Deng, X. and Wang, S. (2011) Sparse linear discriminant analysis by 
thresholding for high dimensional data. Ann. Statist., 39, 12411265. 

Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002) Diagnosis of multiple cancer 
types by shrunken centroids of gene expression. Proceedings of the National Academy of 
Sciences of the United States of America, 99, 6567-6572. 

Witten, D. and Tibshirani, R. (2011) enalized classification using fishers linear discriminant. 
Journal of the Royal Statistical Society, Ser. B, 73, 753-772. 

Xu, P., Brock, G. and Parrish, R. (2009) Modified linear discriminant analysis approaches 
for classification of high-dimensional microarray data. Computational Statistics and Data 
Analysis, 53, 1674-1687. 


23 



Tabic 1: The averages and standard deviations (in parenthesis) of classification errors of 50 
replications for all the settings in all the three models. 


K 

P 

LPD 

SLDA1 

SLDA2 

NSC 

GLDA 

Optimal 

Model 1 

3 

300 

0.027(0.007) 

0.171(0.027) 

0.067(0.013) 

0.052(0.026) 

0.197(0.023) 

0.023(0.006) 

600 

0.030(0.008) 

0.273(0.032) 

0.083(0.012) 

0.052(0.031) 

0.312(0.034) 

0.023(0.007) 

6 

300 

0.063(0.012) 

0.314(0.038) 

0.150(0.032) 

0.100(0.042) 

0.379(0.032) 

0.050(0.011) 

600 

0.079(0.016) 

0.456(0.041) 

0.167(0.022) 

0.091(0.039) 

0.543(0.029) 

0.048(0.011) 

9 

300 

0.095(0.015) 

0.414(0.040) 

0.211(0.037) 

0.140(0.035) 

0.495(0.029) 

0.071(0.013) 

600 

0.134(0.020) 

0.552(0.052) 

0.251(0.047) 

0.152(0.051) 

0.659(0.024) 

0.073(0.013) 

Model 2 

3 

300 

0.027(0.008) 

0.176(0.025) 

0.066(0.016) 

0.068(0.042) 

0.198(0.025) 

0.024(0.008) 

600 

0.027(0.007) 

0.313(0.028) 

0.083(0.013) 

0.068(0.048) 

0.373(0.026) 

0.023(0.008) 

6 

300 

0.069(0.035) 

0.324(0.040) 

0.144(0.036) 

0.109(0.045) 

0.362(0.059) 

0.047(0.016) 

600 

0.066(0.014) 

0.509(0.043) 

0.189(0.053) 

0.109(0.046) 

0.597(0.026) 

0.048 (0.009) 

9 

300 

0.091(0.013) 

0.433(0.047) 

0.231(0.055) 

0.169(0.056) 

0.502(0.022) 

0.072(0.012) 

600 

0.110(0.017) 

0.618(0.055) 

0.292(0.077) 

0.181(0.047) 

0.711(0.024) 

0.073 (0.013) 

Model 3 

3 

300 

0.006(0.004) 

0.066(0.016) 

0.020(0.007) 

0.190(0.029) 

0.068(0.017) 

0.002(0.002) 

600 

0.084(0.039) 

0.673(0.049) 

0.034(0.010) 

0.187(0.032) 

0.160(0.017) 

0.002(0.002) 

6 

300 

0.016(0.006) 

0.171(0.027) 

0.057(0.013) 

0.384(0.025) 

0.176(0.024) 

0.004(0.003) 

600 

0.111(0.062) 

0.835(0.028) 

0.091(0.024) 

0.382(0.025) 

0.337(0.025) 

0.006(0.005) 

9 

300 

0.031(0.008) 

0.262(0.029) 

0.098(0.016) 

0.501(0.030) 

0.257(0.024) 

0.008(0.004) 

600 

0.185(0.069) 

0.897(0.019) 

0.170(0.046) 

0.508(0.024) 

0.456(0.028) 

0.007(0.005) 
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Tabic 2: The average running time (in seconds) and standard deviations (in parenthesis) of 
one replication (including 5 fold cross-validation) for Model 1. 


K 

P 

LPD 

SLDAl 

SLDA2 

3 

300 

600 

9.29(1.12) 

59.42(3.22) 

19.15(0.64) 

58.11(5.84) 

96.84(2.83) 

309.68(9.57) 

6 

300 

600 

25.77(2.35) 

157.52(2.44) 

22.28(3.45) 

57.53(4.11) 

113.98(16.91) 

310.29(7.75) 

9 

300 

600 

48.02(3.42) 

272.59(5.41) 

21.56(0.98) 

61.78(5.56) 

108.77(2.70) 

323.02(8.12) 
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